\chapter{Nonlinear Least Squares Capabilities}\label{nls}
% TODO: discuss calibration overall, then NLS
\section{Overview}\label{nls:overview}
Any Dakota optimization algorithm can be applied to calibration
problems arising in parameter estimation, system identification, and
test/analysis reconciliation. However, nonlinear least-squares
methods are optimization algorithms that exploit the special structure
of a sum of the squares objective function~\cite{Gil81}.

To exploit the problem structure, more granularity is needed in the
response data than is required for a typical optimization problem.
That is, rather than using the sum-of-squares objective function and
its gradient, least-squares iterators require each term used in the
sum-of-squares formulation along with its gradient. This means that
the $m$ functions in the Dakota response data set consist of the
individual least-squares terms along with any nonlinear inequality and
equality constraints. These individual terms are often called
\emph{residuals} when they denote differences of observed quantities
from values computed by the model whose parameters are being
estimated.

The enhanced granularity needed for nonlinear least-squares algorithms
allows for simplified computation of an approximate Hessian matrix.
In Gauss-Newton-based methods for example, the true Hessian matrix is
approximated by neglecting terms in which residuals multiply Hessians
(matrices of second partial derivatives) of residuals, under the
assumption that the residuals tend towards zero at the solution. As a
result, residual function value and gradient information (first-order
information) is sufficient to define the value, gradient, and
approximate Hessian of the sum-of-squares objective function
(second-order information). See Section~\ref{nls:formulations} for
additional details on this approximation.

In practice, least-squares solvers will tend to be significantly more
efficient than general-purpose optimization algorithms when the
Hessian approximation is a good one, e.g., when the residuals tend
towards zero at the solution. Specifically, they can exhibit the
quadratic convergence rates of full Newton methods, even though only
first-order information is used. Gauss-Newton-based least-squares
solvers may experience difficulty when the residuals at the solution
are significant. Dakota has three solvers customized to take
advantage of the sum of squared residuals structure in this problem
formulation. Least squares solvers may experience difficulty when the
residuals at the solution are significant, although experience has
shown that Dakota's NL2SOL method can handle some problems that are
highly nonlinear and have nonzero residuals at the solution.

\section{Nonlinear Least Squares Fomulations}\label{nls:formulations}

Specialized least squares solution algorithms can exploit the
structure of a sum of the squares objective function for problems of
the form:

\begin{eqnarray}
  \hbox{minimize:} & & f(\mathbf{x}) =
  \sum_{i=1}^{n}[T_i(\mathbf{x})]^2\nonumber\\
  & & \mathbf{x} \in \Re^{p}\nonumber\\
  \hbox{subject to:} & &
  \mathbf{g}_L \leq \mathbf{g(x)} \leq \mathbf{g}_U\nonumber\\
  & & \mathbf{h(x)}=\mathbf{h}_{t}\label{nls:equation02}\\
  & & \mathbf{a}_L \leq \mathbf{A}_i\mathbf{x} \leq
  \mathbf{a}_U\nonumber\\
  & & \mathbf{A}_e\mathbf{x}=\mathbf{a}_{t}\nonumber\\
  & & \mathbf{x}_L \leq \mathbf{x} \leq \mathbf{x}_U\nonumber
\end{eqnarray}

where $f(\mathbf{x})$ is the objective function to be minimized and
$T_i(\mathbf{x})$ is the i$^{\mathrm{th}}$ least squares term. The
bound, linear, and nonlinear constraints are the same as described
previously for (\ref{opt:formulations:equation01}). Specialized least
squares algorithms are generally based on the Gauss-Newton
approximation. When differentiating $f(\mathbf{x})$ twice, terms of
$T_i(\mathbf{x})T''_i(\mathbf{x})$ and $[T'_i(\mathbf{x})]^{2}$
result. By assuming that the former term tends toward zero near the
solution since $T_i(\mathbf{x})$ tends toward zero, then the Hessian
matrix of second derivatives of $f(\mathbf{x})$ can be approximated
using only first derivatives of $T_i(\mathbf{x})$. As a result,
Gauss-Newton algorithms exhibit quadratic convergence rates near the
solution for those cases when the Hessian approximation is accurate,
i.e. the residuals tend towards zero at the solution. Thus, by
exploiting the structure of the problem, the second order convergence
characteristics of a full Newton algorithm can be obtained using only
first order information from the least squares terms.

A common example for $T_i(\mathbf{x})$ might be the difference
between experimental data and model predictions for a response
quantity at a particular location and/or time step, i.e.:

\begin{equation}
  T_i(\mathbf{x}) = R_i(\mathbf{x})-\overline{R_i}
  \label{nls:equation03}
\end{equation}

where $R_i(\mathbf{x})$ is the response quantity predicted by the
model and $\overline{R_i}$ is the corresponding experimental data.
In this case, $\mathbf{x}$ would have the meaning of model parameters
which are not precisely known and are being calibrated to match
available data. This class of problem is known by the terms parameter
estimation, system identification, model calibration, test/analysis
reconciliation, etc.

\section{Nonlinear Least Squares with Dakota}

In order to specify a least-squares problem, the responses section of
the Dakota input should be configured using
\texttt{calibration\_terms} (as opposed to
\texttt{objective\_functions} as for optimization). The
calibration terms refer to the residuals (differences between the
simulation model and the data). Note that Dakota expects the
residuals, not the squared residuals, and offers options for instead
returning the simulation output to Dakota together with a separate
\texttt{calibration\_data} file, from which residuals will be
calculated. Any
linear or nonlinear constraints are handled in an identical way to
that of optimization (see Section~\ref{opt:formulations}; note that
neither Gauss-Newton nor NLSSOL require any constraint augmentation
and NL2SOL supports neither linear nor nonlinear constraints).
Gradients of the least-squares terms and nonlinear constraints are
required and should be specified using either
\texttt{numerical\_gradients}, \texttt{analytic\_gradients}, or
\texttt{mixed\_gradients}. Since explicit second derivatives are not
used by the least-squares methods, the \texttt{no\_hessians}
specification should be used. Dakota's scaling options, described in
Section~\ref{opt:additional:scaling} can be used on least-squares
problems, using the \texttt{calibration\_term\_scales} keyword to
scale least-squares residuals, if desired.

\section{Solution Techniques}\label{nls:solution}

Nonlinear least-squares problems can be solved using the Gauss-Newton
algorithm, which leverages the full Newton method from OPT++, the
NLSSOL algorithm, which is closely related to NPSOL, or the NL2SOL
algorithm, which uses a secant-based algorithm. Details for each are
provided below.

\subsection{Gauss-Newton}\label{nls:solution:gauss}

Dakota's Gauss-Newton algorithm consists of combining an
implementation of the Gauss-Newton Hessian approximation (see
Section~\ref{nls:formulations}) with full Newton optimization
algorithms from the OPT++ package~\cite{MeOlHoWi07} (see
Section~\ref{opt:methods:gradient:constrained}). The exact objective
function value, exact objective function gradient, and the approximate
objective function Hessian are defined from the least squares term
values and gradients and are passed to the full-Newton optimizer from
the OPT++ software package. As for all of the Newton-based
optimization algorithms in OPT++, unconstrained, bound-constrained,
and generally-constrained problems are supported. However, for the
generally-constrained case, a derivative order mismatch exists in that
the nonlinear interior point full Newton algorithm will require
second-order information for the nonlinear constraints whereas the
Gauss-Newton approximation only requires first order information for
the least squares terms. License: LGPL.

This approach can be selected using
the \texttt{optpp\_g\_newton} method specification. An example
specification follows:
\begin{small}
\begin{verbatim}
    method,
          optpp_g_newton
            max_iterations = 50
            convergence_tolerance = 1e-4
            output debug
\end{verbatim}
\end{small}

Refer to the Dakota Reference Manual~\cite{RefMan} for more detail on the
input commands for the Gauss-Newton algorithm.

The Gauss-Newton algorithm is gradient-based and is best suited for
efficient navigation to a local least-squares solution in the vicinity
of the initial point. Global optima in multimodal design spaces may be
missed. Gauss-Newton supports bound, linear, and nonlinear
constraints. For the nonlinearly-constrained case, constraint Hessians
(required for full-Newton nonlinear interior point optimization
algorithms) are approximated using quasi-Newton secant updates. Thus,
both the objective and constraint Hessians are approximated using
first-order information.

\subsection{NLSSOL}\label{nls:solution:nlssol}

The NLSSOL algorithm is bundled with NPSOL. It uses an SQP-based
approach to solve generally-constrained nonlinear least-squares
problems. It periodically employs the Gauss-Newton Hessian
approximation to accelerate the search. Like the Gauss-Newton
algorithm of Section~\ref{nls:solution:gauss}, its derivative order is
balanced in that it requires only first-order information for the
least-squares terms and nonlinear constraints. License: commercial;
see NPSOL~\ref{opt:methods:gradient:constrained}.

This approach can be selected using the \texttt{nlssol\_sqp} method
specification. An example specification follows:
\begin{small}
\begin{verbatim}
    method,
          nlssol_sqp
            convergence_tolerance = 1e-8
\end{verbatim}
\end{small}

Refer to the Dakota Reference Manual~\cite{RefMan} for more detail on the
input commands for NLSSOL.

\subsection{NL2SOL}\label{nls:solution:nl2sol}

The NL2SOL algorithm~\cite{Den81} is a secant-based least-squares
algorithm that is $q$-superlinearly convergent. It adaptively chooses
between the Gauss-Newton Hessian approximation and this approximation
augmented by a correction term from a secant update. NL2SOL tends to
be more robust (than conventional Gauss-Newton approaches) for
nonlinear functions and ``large residual'' problems, i.e.,
least-squares problems for which the residuals do not tend towards
zero at the solution. License: publicly available.

\subsection{Additional Features}\label{nls:solution:future}

Dakota's tailored derivative-based least squares solvers (but not
general optimization solvers) output confidence intervals on estimated
parameters. The reported confidence intervals are univariate
(per-parameter), based on local linearization, and will contain the
true value of the parameters with 95\% confidence. Their calculation
essentially follows the exposition in ~\cite{Seb03} and is summarized
here.

Denote the variance estimate at the optimal calibrated parameters
$\hat{x}$ by
\begin{equation}
\hat{\sigma}^2 = \frac{1}{N_{dof}}\sum_{i=1}^{n} T_i(\hat{x})^2,
\end{equation}
where $T_i$ are the least squares terms (typically residuals)
discussed above and $N_{dof} = n - p$ denotes the number of degrees of
freedom (total residuals $n$ less the number of calibrated parameters
$p$). Let
\begin{equation}
J = \left[ \frac{\partial T(\hat{x})}{\partial x} \right]
\end{equation}
denote the $n \times p$ matrix of partial derivatives of the residuals
with respect to the calibrated paramters. Then the standard error $SE_i$
for calibrated parameter $x_i$ is given by
\begin{equation}
SE_i = \hat{\sigma} \sqrt{\left( J^T J \right)^{-1}_{ii} }.
\end{equation}
Using a Student's t-distribution with $N_{dof}$ degrees of freedom,
the 95\% confidence interval for each parameter is given by
\begin{equation}
  \hat{x}_i \pm t(0.975, N_{dof}) \cdot SE_i.
\end{equation}

In the case where estimated gradients are extremely inaccurate or the
model is very nonlinear, the confidence intervals reported are likely
inaccurate as well.  Further, confidence intervals cannot be
calculated when the number of least-squares terms is less than the
number of parameters to be estimated, when using vendor numerical
gradients, or where there are replicate experiments.  See~\cite{Vug07}
for more details about confidence intervals, and note that there are
alternative approaches such as Bonferroni confidence intervals and
joint confidence intervals based on linear approximations or F-tests.

Least squares calibration terms (responses) can be weighted.  When
observation error variance is provided alongside calibration data, its
inverse is applied to yield the typical variance-weighted least
squares formulation.  Alternately, the \texttt{calibration\_terms
  weights} specification can be used to weight the squared
residuals. (Neither set of weights are adjusted during calibration as
they would be in iteratively re-weighted least squares.)  When
response scaling is active, it is applied after error variance
weighting and before \texttt{weights} application.  The
\texttt{calibration\_terms} keyword documentation in the Dakota
Reference Manual~\cite{RefMan} has more detail about weighting and
scaling of the residual terms.

\section{Examples}\label{nls:examples}

Both the Rosenbrock and textbook example problems can be formulated as
nonlinear least-squares problems. Refer to Chapter~\ref{additional}
for more information on these formulations.
%Figure~\ref{nls:figure01}
%shows an excerpt from output for the Rosenbrock example solved by
%the Gauss-Newton method.
%
%\begin{figure}
%\begin{bigbox}
%\begin{small}
%\begin{verbatim}
%     Active response data for function evaluation 1:
%     Active set vector = { 3 3 } Deriv vars vector = { 1 2 }
%                          -4.4000000000e+00 least_sq_term_1
%                           2.2000000000e+00 least_sq_term_2
%      [  2.4000000000e+01  1.0000000000e+01 ] least_sq_term_1 gradient
%      [ -1.0000000000e+00  0.0000000000e+00 ] least_sq_term_2 gradient
%\end{verbatim}
%\end{small}
%\end{bigbox}
%\caption{Example of intermediate output from a Gauss-Newton method.}
%\label{nls:figure01}
%\end{figure}

Figure~\ref{nls:figure02} shows an excerpt from the output 
obtained when running NL2SOL on a five-dimensional problem. 
Note that the optimal parameter estimates are printed, 
followed by the residual norm and values of the individual 
residual terms, followed by the confidence intervals on the parameters. 

\begin{figure}
\begin{bigbox}
\begin{small}
\begin{verbatim}
<<<<< Iterator nl2sol completed.
<<<<< Function evaluation summary: 27 total (26 new, 1 duplicate)
<<<<< Best parameters          =
                      3.7541004764e-01 x1
                      1.9358463401e+00 x2
                     -1.4646865611e+00 x3
                      1.2867533504e-02 x4
                      2.2122702030e-02 x5
<<<<< Best residual norm =  7.3924926090e-03; 0.5 * norm^2 =  2.7324473487e-05
<<<<< Best residual terms      =
                     -2.5698266189e-03
                      4.4759880011e-03
                      9.9223430643e-04
                     -1.0634409194e-03

...

Confidence Interval for x1 is [  3.7116510206e-01,  3.7965499323e-01 ]
Confidence Interval for x2 is [  1.4845485507e+00,  2.3871441295e+00 ]
Confidence Interval for x3 is [ -1.9189348458e+00, -1.0104382765e+00 ]
Confidence Interval for x4 is [  1.1948590669e-02,  1.3786476338e-02 ]
Confidence Interval for x5 is [  2.0289951664e-02,  2.3955452397e-02 ]

\end{verbatim}
\end{small}
\end{bigbox}
\caption{Example of confidence intervals on optimal parameters}
\label{nls:figure02}
\end{figure}

The analysis driver script (the script being driven by Dakota) has to
perform several tasks in the case of parameter estimation using
nonlinear least-squares methods. The analysis driver script must: (1)
read in the values of the parameters supplied by Dakota; (2) run the
computer simulation with these parameter values; (3) retrieve the
results from the computer simulation; (4) compute the difference
between each computed simulation value and the corresponding
experimental or measured value; and (5) write these residuals
(differences) to an external file that gets passed back to
Dakota. Note there will be one line per residual term, specified with
\texttt{calibration\_terms} in the Dakota input file. It is the last
two steps which are different from most other Dakota applications.

To simplify specifying a least squares problem, one may provide Dakota a
data file containing experimental results or other calibration data.
In the case of scalar calibration terms, 
this file may be specified with \dakotakw{calibration_data_file}.  In
this case, Dakota will calculate the residuals (that is, the
simulation model results minus the experimental results), and the
user-provided script can omit this step: the script can just return
the simulation outputs of interest. An example of this can be found in
the file named
\path{dakota/share/dakota/examples/users/textbook_nls_datafile.in}.  In this
example, there are 3 residual terms. The data file of experimental
results associated with this example is
\path{textbook_nls_datafile.lsq.dat}.  These three values are
subtracted from the least-squares terms to produce residuals for the
nonlinear least-squares problem.  Note that the file may be annotated
(specified by \texttt{annotated}) or freeform (specified by
\texttt{freeform}). The number of experiments in the calibration data
file may be specified with \dakotakw{num_experiments}, with one row of
data per experiment. 
When multiple experiments are present, the total number of least 
squares terms will be the number of calibration terms times the number of 
experiments.

Finally, the calibration data file may contain additional 
information than just
the observed experimental responses. If the observed data has
measurement error associated with it, this can be specified in columns
of such error data after the response data.  The type of measurement 
error is specified by \texttt{variance\_type}.  For scalar 
calibration terms, the \texttt{variance\_type} can be either \texttt{none} 
(the user does not specify a measurement variance associated with each 
calibration term) or \texttt{scalar} (the user specifies one measurement variance 
per calibration term). 
For field calibration terms, the \texttt{variance\_type} can also 
be \texttt{diagonal} or \texttt{matrix}.  These are explained in 
more detail in the Reference manual.
Additionally, there is sometimes the
need to specify configuration variables.  These are often used in
Bayesian calibration analysis. These are specified as
\texttt{num\_config\_variables}. If the user specifies a positive
number of configuration variables, it is expected that they will occur
in the text file before the responses.

\section{Usage Guidelines}\label{nls:usage}

% TODO: borrow from and refer to opt.

Calibration problems can be transformed to general optimization problems 
where the objective is some type of aggregated error metric.  For 
example, the objective could be the sum of squared error terms. 
However, it also could be the mean of the absolute value of the error 
terms, the maximum difference between the simulation results and 
observational results, etc. In all of these cases, one can 
pose the calibration problem as an optimization problem that can be 
solved by any of Dakota's optimizers.  In this situation, when 
applying an general optimization solver to a calibration problem,
the guidelines in Table~\ref{opt:usage} still apply. 

In some cases, it will be better to use a nonlinear least-squares
method instead of a general optimizer to determine optimal parameter 
values which result in simulation responses that ``best fit'' the 
observational data.  Nonlinear least squares methods exploit the 
special structure of a sum of the squares objective function. They 
can be much more efficient than general optimizers.  However, 
these methods require the gradients of the function with 
respect to the parameters being calibrated.  If the model 
is not able to produce gradients, one can use finite differencing 
to obtain gradients.  However, the gradients must be reasonably 
accurate for the method to proceed. Note that the nonlinear 
least-squares methods only operate on a sum of squared errors 
as the objective. Also, the user must return each residual term 
separately to Dakota, whereas the user can return an aggregated 
error measure in the case of general optimizers.

The three nonlinear least-squares methods are the Gauss-Newton method in 
OPT++, NLSSOL, and NL2SOL.  Any of these may be tried;  
they give similar performance on many problems. 
NL2SOL tends to be more robust than Gauss-Newton, especially for nonlinear 
functions and large-residual problems where one is not able to drive 
the residuals to zero at the solution.  NLSSOL does require 
that the user has the NPSOL library.   
Note that all of these methods are local in the sense that they are 
gradient-based and depend on an initial starting point.  Often they 
are used in conjunction with a multi-start method, to perform 
several repetitions of the optimization at different starting 
points in the parameter space.  Another approach is to use 
a general global optimizer such as a genetic algorithm or 
DIRECT as mentioned above.  This can be 
much more expensive, however, in terms of the number of function 
evaluations required. 
